Implementation for Bankruptcy Data Science Challenge

Loading the data

Total number of rows 12034 and columns are 14.

Creating Target labels based on matched and feature transaction ids

The dataset is highly skewed/imbalanced, only 857 correct matches are present.

EDA- Data visualization

Imbalanced Data

Plot a correlation map for all numeric variables

Balance the data set using SMOTE

DateMappingMatch is 0 for more than 8k samples.

Evaluation Metric

- Recall is metric that we need to consider, the higher the recall the better the model is..

Baseline Models

Using Cross validation to see the best model

Fine tunning KNN

Tune scaled KNN

scaler = StandardScaler().fit(X_train) rescaledX = scaler.transform(X_train) neighbors = [1,3,5,7,9,11,13,15,17,19,21] param_grid = dict(n_neighbors=neighbors) model = KNeighborsClassifier() kfold = KFold(n_splits=num_folds) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold) grid_result = grid.fit(rescaledX, Y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

USE OF XGBOOST

Convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains.

FEATURE IMPORTANCE USING XGBOOST

Improving XGBOOST

Future steps:

Conclusion: